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Abstract: A multitude of measures have been proposed to quantify the similarity 
between protein 3-D structure. Among these measures, contact map overlap (CMO) 
maximization deserved sustained attention during past decade because it offers a fine 
estimation of the natural homology relation between proteins. Despite this large in- 
volvement of the bioinformatics and computer science community, the performance of 
known algorithms remains modest. Due to the complexity of the problem, they got 
stuck on relatively small instances and are not applicable for large scale comparison. 

This paper offers a clear improvement over past methods in this respect. We present 
a new integer programming model for CMO and propose an exact B&B algorithm with 
bounds computed by solving Lagrangian relaxation. The efficiency of the approach is 
demonstrated on a popular small benchmark (Skolnick set, 40 domains). On this set our 
algorithm significantly outperforms the best existing exact algorithms, and yet provides 
lower and upper bounds of better quality. Some hard CMO instances have been solved 
for the first time and within reasonable time limits. From the values of the running 
time and the relative gap (relative difference between upper and lower bounds), we 
obtained the right classification for this test. These encouraging result led us to design 
a harder benchmark to better assess the classification capability of our approach. We 
constructed a large scale set of 300 protein domains (a subset of ASTRAL database) 
that we have called Proteus_300. Using the relative gap of any of the 44850 couples as 
a similarity measure, we obtained a classification in very good agreement with SCOP. 
Our algorithm provides thus a powerful classification tool for large structure databases. 

Key-words: Protein structure alignment. Contact Map Overlap maximization, com- 
binatorial optimization, integer programming, branch and bound, Lagrangian relax- 
ation. 
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Vers une classification structurelle des proteines basee 
sur le recouvrement des cartes de contacts 



Resume : Une multitude de mesures ont ete proposees pour quantifier la simila- 
rite entre les structures 3D de proteines. Parmis ces mesures, la maximisation du re- 
couvrement des cartes de contacts ("Contact Map Overlap Maximization", CMO) a 
re9u durant les dix dernieres annees une attention soutenue, car elle permet une bonne 
estimation des relations naturelles d'homologie entre proteines. Cependant, malgre 
r implication des communautes de bio-informatique et de sciences computationnelles, 
les performances des algorithmes connus restent modestes. A cause de la complexity 
du probleme, ces algorithmes sont Umites a de petites instances et ne sont pas appli- 
cables pour des comparaions a grandes echelles. 

Ce rapport marque une nette amelioration sur ce point par rapport aux methodes 
precedentes. Nous presentons un nouveau modele de programmation lineaire en nombre 
entier pour CMO, et nous proposons un algorithme exact de separation et evaluation 
dont les bornes proviennent de la relaxation lagrangienne de notre modele. L'efficacite 
de cette approche est demontree sur un petit ensemble de test connu (1' ensemble de 
skolnick, 40 domaines). Sur ce jeu de test, notre algorithme surpasse en rapidite 
d' execution les meilleurs algorithmes existants tout en obtenant des bornes de meilleurs 
qualite. Quelques instances difficiles de CMO ont ete resolues pour la premiere fois, 
et ce en des temps raisonnnables. A partir des valeurs de temps de calculs et de "gaps" 
relatifs (la difference relative entre la borne superieur et inferieure), nous avons obtenu 
la borme classification de 1' ensemble de skolnick. Ces resultats encourageants nous ont 
pousses a creer un jeu de test plus difficile pour confirmer les capacites de classification 
de notre approche. Nous avons construit un ensemble de test contenant 300 domaines 
de proteines (un sous-ensemble d' ASTRAL) que nous avons appele Proteus_300. En 
utilisant le gap relatif des 44850 couples comme une mesure de similarite, nous avons 
obtenu une classification en tres bon accord avec SCOP. Notre algorithme offre done 
un outil puissant pour la classification de grandes bases de donnees de structures. 

Mots-cles : AUgnement de structures de proteines, maximisation du recouvrement 
de cartes de contacts, optimisation combinatoire, programmation Uneaire en nombre 
entier, separation et evaluation, relaxation lagrangienne. 
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1 Introduction 

A fruitful assumption of molecular biology is that proteins sharing close three-dimensional 
(3D) structures are likely to share a common function and in most case derive from a 
same ancestor. Computing the similarity between two protein structures is therefore a 
crucial task and has been extensively investigated iH [I4l [TS] l22ll . Interested reader 
can also refers i6l|71[8l|9l[l0l[n][l2l[l8l. Since it is not clear what quantitative 
measure to use for comparing protein structures, a multitude of measures have been 
proposed. Each measure aims in capturing the intuitive notion of similarity. We stud- 
ied the contact-map-overlap (CMO) maximization, a scoring scheme first proposed in 
lfT6l . This measure has been found to be very useful for estimating protein similarity 
- it is robust, takes partial matching into account, translation invariant and captures 
the intuitive notion of similarity very well. The protein's primary sequence is usually 
thought as composed of residues. Under specific physiological conditions, the linear 
arrangement of residues will fold and adopt a complex 3D shape, called native state 
(or tertiary structure). In its native state, residues that are far away along the linear 
arrangement may come into proximity in 3D space. The proximity relation is captured 
by a contact map. Formally, a map is specified by a — 1 symmetric squared matrix 
C where c,j = 1 if the Euclidean distance of two heavy atoms (or the minimum dis- 
tance between any two atoms belonging to those residues) from the /-th and the y-th 
amino acid of a protein is smaller than a given threshold in the protein native fold. In 
the CMO approach one tries to evaluate the similarity of two proteins by determining 
the maximum overlap (also called alignment) of contacts map. Formally: given two 
adjacency matrices, find two sub-matrices that correspond to principle minor^ having 
the maximum inner product if thought as vectors (i.e. maximizing the number of 1 on 
the same position). 

The counterpart of the CMO problem in the graph theory is the well known maxi- 
mum common subgraph problem (MCS) |17|. The bad news for the later is its APX- 
hardnesfl The only difference between the above defined CMO and MCS is that the 
isomorphism used for the MCS is not restricted to the non-crossing matching only. 

' matrix that coiTesponds to a principle minor is a sub-matrix of a squared matrix obtained by deleting k 
rows and the same k columns 

^see "A compendium of NP optimization problems", http://www.nada.kth.se/~viggo/problemlist/ 
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Nevertheless the CMO is also known to be NP-hard ifTSll . Thus the problem of design- 
ing efficient algorithms that guarantee the CMO quality is an important one that has 
eluded researchers so far The most promising approach for solving CMO seems to be 
integer programming coupled with either Lagrangian relaxation |5 1 or B&B reduction 
technique ||2TI . 

The results in this paper confirm once more the superiority of Lagrangian relaxation 
to CMO since the algorithm we present belongs to the same class. Our interest in CMO 
was provoked by its similarity with the protein threading problem. For the later we have 
presented an approach based on the so called non-crossing matching in bipartite graphs 
01. It yielded a highly efficient algorithm solving the PTP by using the Lagrangian 
duality a EH. 

The contributions of this paper are as follows. We propose a new integer program- 
ming formulation of the CMO problem. For this model, we design a B&B algorithm 
coupled with a new Lagrangian relaxation for bounds computing. We compare our ap- 
proach with the best existing exact algorithms f5','2TI| on a widely used benchmark (the 
Skolnick set), and we noticed that it outperforms them significantly. New hard Skolnick 
set instances have been solved. In addition, we observed that our Lagrangian approach 
produces upper and lower bounds of better quality than in |5, 21 1. This suggested us 
to use the relative gap (a function of these two bounds) as a similarity measure. To 
the best of our knowledge we are the first ones to propose such criterion for similar- 
ity. Our results demonstrated the very good classification potential of our method. Its 
capacity as classifier was further tested on the Proteus_300 set, a large benchmark of 
300 domains that we extracted from ASTRAL -40 |23|. We are not aware of any pre- 
vious attempt to use a CMO tool on such large database. The obtained classification is 
in very good agreement with SCOP classification. This clearly demonstrates that our 
algorithm can be used as a tool for large scale classification. 

2 The mathematical model 

We are going to present the CMO problem as a matching problem in a bipartite graph, 
which in turn will be posed as a longest augmented path problem in a structured graph. 
Toward this end we need to introduce few notations as follows. The contacts maps of 
two proteins P 1 and P2 are given by graphs Gm = (V™ ,£,„) with V,,, = { 1 , 2 n„, } for 
m= 1,2. The vertices Vm are better seen as ordered points on a line and correspond to 
the residues of the proteins. The arcs (/, j) correspond to the contacts. The right and left 
neighbouring of node / are elements of the sets 5,^(/) = {j\j > i, (/, j) e E,n}, 5^(0 = 
< '7 UJ) 6 Em}. Let / G Vi be matched with A: G V2 and j G Vi be matched with 
/ G ¥2- We will call a matching non-crossing, if / < j implies k<l. A feasible alignment 
of two proteins P\ and Pi is given by a non-crossing matching in the complete bipartite 
graph B with a vertex set Vi U V2. 

Let the weight Wikji of the matching couple {i,k){j,l) be set as follows 



For a given non-crossing matching M in B we define its weight w{M) as a sum over all 
couples of edges in M. The CMO problem consists then in maximizing w{M), where 
M belongs to the set of all non-crossing matching in B. 

In |[T] |2] El m we have already dealt with non-crossing matching and we have pro- 
posed a network flow presentation of similar one-to-one mappings (in fact the mapping 




(1) 
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there was many-to-one). The adaptation of this approach to CMO is as follows: The 
edges of the bipartite graph B are mapped to the points of ni x nx rectangular grid 
B' = {V',E') according to: point - {i,k) e V' < — > edge - {i,k) in B. 

Definition. The feasible patli is an arbitrary sequence {ii,ki),{i2,k2),. . . ,{ii,kt) 
of points in B' such that ij < ij^i and kj < kj+\ for j = l,2,...,f — 1. 

The correspondence feasible path ^ non-crossing matching is obvious. This way 
non-crossing matching problems are converted to problems on feasible paths. We also 
add arcs ~> (y, I) G E' iff Witji = 1. In B' , solving CMO corresponds to finding the 
densest (in terms of arcs) subgraph of B' whose node set is a feasible path (see Fig.[T]|. 



VI 




Figure 1: Left: Vertex 1 from VI is matched with vertex 1 from V2 and 2 is matched 
with 3: matching couple (1, 1)(2,3). Other matching couples are (3,4)(5,5). This de- 
fines a feasible matchingM = {(1,1)(2,3),(3,4)(5,5)} with weight w(M) =2. Right: 
The same matching is visualized in graph B' . 

To each node (/ , k) G V' we associate now a 0/ 1 variable x/^., and to each arc (/, k) 
{j, I) S £■', a 0/1 variable yikji- Denote by X the set of feasible paths. The problem can 
now be stated as follows (see Fig. |2]a) for illustration) 

v(CMO)= max £ yttji (2) 
iik)Ul)eE' 



subject to 



/e8+(i) ' ' ' 

^ v"* ■ Q — / ■\ / = 2,3,...,nl, 

xik> Z y.nik, ;e8iW ;t = 2,3,...,«2 



/G82 {k) 



iG5+(/) ' 

xex (7) 
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Actually, we know how to represent X with linear constraints. RecalUng the defi- 
nition of feasible path, (|7]i is equivalent to 

k i-l 

^jc,7 + ^jcyA; < 1, ! = l,2,...,nl, k=l,2,...,n2. (8) 

1=1 i=i 

We recall that from the definition of the feasible paths in B' (non-crossing matching 
in B) the j-th residue from PI could be matched with at most one residue from P2 and 
vice-versa. This explains the sums into right hand side of (O and (|5]l - for arcs having 
their tails at vertex {i,k); and (Hji and (|6]l- for arcs heading to {i,k). Any {i,k){j,l) 
arc can be activated (yjkji = 1) iff xn^ ~ 1 and xji — 1 and in this case the respective 
constraints are active because of the objective function. 

A tighter description of the poly tope defined by (l3]l-(|6]l and < x,^ < 1, < yikji 
could be obtained by lifting the constraints (|4|i and ^ as it is shown in Fig. |2]b). 
The points shown are just the predecessors of (/, k) in graph B' and they form a grid 
of 5j"(!) rows and (^) columns. Let 11,12, ■ ■ ■ ,is be all the vertices in ?>^{i) ordered 
according the numbering of the vertices in Vi and likewise ki,k2, ... ,k, 11162 (k). Then 
the vertices in the Z-th column {ii,ki), {12, ki), . . . {is,ki) correspond to pairwise crossing 
matching and at most one of them could be chosen in any feasible solution x G X (see 
This "all crossing" property will stay even if we add to this set the following 
two sets: (/i,^2), ■ • • , and {is,ki+i),{is,ki+2), ■ ■ ■ ,{h,kt). Denote by 

colik{l) the union of these three sets and analogously by roWik{j) the corresponding 
union for the j-th row of the grid. When the grid is one column/row only the set 
rowik{j)lcolik{l) is empty. 

Now a tighter LP relaxation of dSj-® is obtained by changing (|4|i with 



(r,s)eroWik(j) 



and ^ with 



(r,s)G«)/,t(/) 



Remark: Since we are going to apply the Lagrangian technique there is no need 
neither for an explicit description of the set X neither for lifting the constraints © Q. 



3 Lagrangian relaxation approach 

Here, we show how the Lagrangian relaxation of constraints (|9]l and ( fTOl i leads to an 
efficiently solvable problem, yielding upper and lower bounds that are generally better 
than those found by the best known exact algorithm [5 1. 

Let Tj^i^j > (respectively X'^/^j > 0) be a Lagrangian multiplier assigned to each con- 
straint (|9|l (respectively (fTOll). By adding the slacks of these constraints to the objective 
function with weights X, we obtain the Lagrangian relaxation of the CMO problem 

LR{X)= max ^ X'-i^j{xik- ^ yrsik) 

i,k.je5-[{i) (r,s)erowu,{j) ^^^^ 

+ E ^lli^ik- E yr.nk)+ E ^'iX-;/ 

subject to X 6 X, (O, (|5ll and y>Q. 
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Figure 2: The shadowed area represents the set of vertices in V' which are tails for the 
arcs heading to In a): T corresponds to the indices of yjni^ in ^ for I fixed. 

O corresponds to the indices of yjni^ in (|4]i for j fixed. In b): T corresponds to the 
indices of yjni^ in ( fTOl i for I fixed (the set coliii{l)). Q corresponds to the indices of yjni^ 
in (|9]l for j fixed (the set rowik{j))- 



Proposition 1 LR{X) can be solved in 0{\V'\ + \E'\) time. 

Proof: For each (/, k) G V ', if = 1 then the optimal choice yttji amounts to solving 
the following : The heads of all arcs in E' outgoing from form a |5+(/)| x |5+(A;)| 
table. To each point {j,l) in this table, we assign the profit max{0,c,xj7(A-) }, where 
CikjiiX) is the coefficient of yniji in (fTTT i. Each vertex in this table is a head of an arc 
outgoing from {i,k). Then the subproblem we need to solve consists in finding a subset 
of these arcs having a maximal sum c,a:(A-) of profits(the arcs of negative weight are ex- 
cluded as a candidates for the optimal solution) and such that their heads lay on a feasi- 
ble path. This could be done by a dynamic programming approach in (9 ( 1 5^ (/) 1 1 5^ (A:) | ) 
time. Once profits c,jt(A,) have been computed for all {i,k) we can find the optimal so- 
lution to LR(X) by using the same DP algorithm but this time on the table of nl x n2 
points with profits for {i,k)-th one given by 

cd^)+ L 4./+ E ^u- (12) 
;g5i-(,-) ieb:;{k) 

where the last two terms are the coefficients of Xj/^ in (fTTT i. 

Remark: The inclusion jc e X is explicitly incorporated in the DP algorithm. 

3.1 The algorithm 

In order to find the tightest upper bound on v{CMO) (or eventually to solve the prob- 
lem), we need to solve in the dual space of the Lagrangian multipliers LD = minx^Q LR(k), 
whereas LR{X) is a problem in x,y. A number of methods have been proposed to 
solve Lagrangian duals: subgradient method, dual ascent methods,constraint genera- 
tion method, column generation, bundel methods, augmented Lagrangian methods, etc. 
Here, we choose the subgradient method. It is an iterative method in which at iteration 
t, given the current multiplier vector A-', a step is taken along a subgradient of LR{X), 
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then if necessary, the resuhing point is projected onto the nonnegative orthant. It is 
well known that practical convergence of the subgradient method is unpredictable. For 
some problems, convergence is quick and fairly reliable, while other problems tend to 
produce erratic behavior of the multiplier sequence, or the Lagrangian value, or both. 
In a "good" case, one usually observe a saw-tooth pattern in the Lagrangian value for 
the first iterations, followed by a roughly monotonic improvement and asymptotic con- 
vergence to a value that is hopefully the optimal Lagrangian bound. The computational 
runs on a reach set of real-life instances confirm a "good" case belonging of our ap- 
proach at some expense in the speed of the convergence. 

In our realization, the update scheme for Xn^j is A-^^^' — max{0,X;^,^ — ^'s'ikj} S 
where g'-i^j — x,i — Y^yjiik (see (|9]l and (fTOl i for the sum definition) is the sub-gradient 
component (0, l,or —1), calculated on the optimal solution ic, y of LR(k') . The step size 
& is 0' = . , .2 " ,2 where Z/^ is a known lower bound for the CMO problem 

and a is an input parameter. Into this approach the A:-components of LRiV) solution 
provides a feasible solution to CMO and thus a lower bound also. The best one (incum- 
bent) so far obtained is used for fathoming the nodes whose upper bound falls below 
the incumbent and also in section |4]for reporting the final gap. If LD < v{CMO) then 
the problem is solved. If LD > v{CMO) holds, in order to obtain the optimal solution, 
one could pass to a branch&bound algorithm suitably tailored for such an upper bounds 
generator. 

From among various possible nodes splitting rules, the one shown in Fig. [3] gives 
quite satisfactory results (see section |4]i. Formally, let the current node be a sub- 
problem of CMO defined over the vertices of V' falling in the interval [lc{k),uc{k)] 
for k ^ l,n2 (in Fig. [3] these are the points in-between two broken lines (the white 
area). Let {rowbest ,colbest) be the argmaxmin(5„(!,A:),5;/(/,A;)), where Sd{i,k) = 
^y<^max(Mc(y) — i,0) and Su{i,k) = ^y>^.max(/ — lc{j),Q). Now, the two descen- 
dants of the current node are obtained by discarding from its feasible set the vertices 
in Sd{rowbest ,colbest) and Su{rowbest ,colbest) respectively. The goal of this strategy 
is twofold: to create descendants that are balanced in sense of feasible set size and to 
reduce maximally the parent node's feasible set. 

In addition, the following heuristics happened to be very effective during the tra- 
verse of the B&B tree nodes. Once the lower and the upper bound are found at the root 
node, an attempt to improve the lower bound is realized as follows. 

Let ,ki),{iii^,k2), ■ . . , (Uji^i) be an arbitrary feasible path which activates cer- 
tain number of arcs (recall that each iteration in the sub-gradient optimization phase 
generates such path and lower bound as well). 

Then for a given strip size sz (an input parameter set by default to 4), the matchings 
in the original CMO are restricted to fall in a neighborhood of this path, allowing xnc to 
be non zero only for 

max{ 1 , ij — sz} < i < min{n + sz},j ^ ki,k2, ■ ■ ■ ,ks. 

The Lagrangian dual of this subproblem is solved and a better lower bound is pos- 
sibly sought. If the bound improves the incumbent, the same procedure is repeated by 
changing the strip alongside the new feasible solution. 

Finally, the main steps of the B&B algorithm are as follows: 
Initialization: Set L= { original CMO problem, i.e. no restrictions on the feasible paths } . 
Problem selection and relaxation: Select and delete the problem P' from L having the 

^analogously for Xnii 
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Figure 3: Sketch of the B&B sphtting strategy. a) the white area in-between 
broken lines represents the current node feasible set; b) This set is split by 
(rowbest ,colbest), D corresponds to the set Sd{rowbest ,colbest) while U corresponds 
to the set Sii{rowbest ,colbest); c) and d) are the descendants of the node a). 

biggest upper bound. Solve the Lagrangian dual of P'. (Here a repetitive call to a 
heuristics is included after each improvement on the lower bound). 
Fathoming and Pruning: Follow classical rules. 

Partitioning : Create two descendants of P' using (rowbest, colbest) and add them to L. 
Termination : if L = 0, the solution {x*,y*) yielding the objective value is optimal. 

4 Numerical results 

To evaluate the above algorithm we performed two kinds of experiments. In the first 
one we compared our approach with the best existing algorithm from literature Q in 
term of performance and quality of the bounds. This comparison was done on a set of 
proteins suggested by Jeffrey Skolnick which was used in various recent papers related 
to protein structure comparison Js] [Ts] ET] . This set contains 40 medium size domains 
from 33 proteins, which number of residues varies from 95 (2b3iA) to 252 (law2A). 
The maximum number of contacts is 593 (IbtmA). We afterwards experimentally 
evaluated the capability of our algorithm to perform as classifier on the Proteus_300 set, 
a significantly larger protein set. It contains 300 domains, which number of residues 
varies from 64 (dl5bba_) to 455 (dlpo5a_). Its maximum number of contact is 1761 
(dli24a_). We will soon make available all data and result^ on the URL: 
http://www.irisa.fr/symbiose/softwares/resources/proteus300 

4.1 Performance and quality of bounds 

The results presented in this section were obtained on machines with AMD Opteron(TM) 
CPU at 2.4 GHz, 4 Gb Ram, RedHat 9 Linux. The algorithm was implemented in C. 
According to SCOP classificatior0 |19|, the Skolnick set contains five families (see 
Table |2] in Annexe jl. Note that both approaches that we compare use different La- 

solved instances, upper and lower bounds, computational time, classifications... 
^Using SCOP version 1.71 

'Caprara et al. |5 | mention only four families. This wrong classification was also accepted in 1181 but 
not in 1211 . The families are in fact five as shown in Table |2] According to SCOP classification the protein 
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grangian relaxations. Our algorithm is called a_purv6Q, while the other Lagrangian 
algorithm is denoted by LR. 

The Skolnick set requires aligning 780 pairs of domains. We bounded the execution 
time to 1800 seconds for both algorithms. a_purva succeeded to solve 171 couples in 
the given time, while LR solved only 157 couples. Note that another exact algorithm 
called CMOS has been proposed in a very recent paper f2T\. CMOS succeeded to solve 
only 161 instances from the Skolnick set, yet the time limit was 4 hours on a similar 
workstation. Hence it seems that 171 is the best score ever obtained when exactly 
solving Skolnick set. To the best of our knowledge, we are the first ones to solve all 
the 164 instances with couples from the same SCOP folds, as well as the first to solve 
instances with couples from different folds (the 7 instances of the 6''' class presented 
in Table [T]). The interested reader can find our detailed results on the webpage cited 
before. 

Figure m illustrates LR/a_purva time ratio as a function of solved instances. It is 
easily seen that a_purva is significantly faster than LR (up to several hundred times 
in the majority of cases). Table [T| in the Annexe contains more details concerning a 
subset of 164 pairs of proteins. We observed that this set is a very interesting one. It is 
characterized by the following properties: a) in all but the 6 last instances the a_purva 
running time is less than 10 seconds; b) in all instances the relative gaf0 at the root 
of the B&B is smaller than 4, while in all other instances this gap is much larger 
(greater than 18 even for couples solved in less than 1800 sec); c) this set contains 
all instances such that both proteins belong to the same family according to SCOP 
classification. In other words, each pair such that both proteins belong to the same 
family is an easily solvable instance for a_purva and this feature can be successfully 
used as a discriminator In fact, by virtue of this relation we were able to correctly 
classify the 40 items in the Skolnick set in 2000 seconds overall running time for all 
780 instances. We will go back over this point in the next section. 



ne ratio 




80 100 
Solved instances. 



Figure 4: — LR tirne — i-g^[Q ^ function of solved instances 
° a_purva time 



Iml does not belong to the first family as indicated in 151. Note that this corroborates the results obtained in 
(5] but the authors considered it as a mistake. 

^Apurva (Sanskrit) = not having existed before, unknown, wonderful, ... 

^We define the relative gap as 100 x "^g" . 
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Our next observation (see Fig. |5]and Fig. |6]in the Annexe) concerns the quality 
of gaps obtained by both algorithms on the set of unsolved instances. Remember that 
when a Lagrangian algorithm stops because of time limit (1800 sec. in our case) it 
provides two bounds: one upper (UB), and one lower (LB). Providing these bounds 
is a real advantage of a B&B type algorithm compared to any meta-heuristics. These 
values can be used as a measure for how far is the optimization process from finding 
the exact optimum. The value UB-LB is usually called absolute gap. Any one of the 
609 points {x^y) in Fig. |5]presents the absolute gap for a_purva {x coordinate) and 
for LR (y coordinate) algorithm. All points are above the y ~ x line (i.e. the absolute 
gap for a_purva is always smaller than the absolute gap for LR). On the other hand the 
entire figure is very asymmetric in a profit of our algorithm since its maximal absolute 
gap is 33, while it is 183 for LR. 

In Fig.|6]we similarly compare lower and upper bounds separately. Any point o has 
the lower bound computed by a_purva (res. LR) as x (res. y) coordinate, while any 
point X has the upper bound computed by a_purva (res. LR) as x (res. y) coordinate. 
We observe that in a large majority the points o are below the y —x line while the points 
X are above this line. This means that usually a_purva lowers bounds are higher, while 
its upper bounds are all smaller and therefore a_purva provides bounds with clearly 
better quality than LR. We don't have much information about the bounds find by CMOS, 
except that at the root of the B&B tree, it obtains upper bounds of worst quality than 
the ones of LR. 



absolute gaps 




100 

gaps for a_purva algorithrr 



Figure 5: Comparing absolute gaps on the set of unsolved instances. The gaps com- 
puted by a_purva are significantly smaller. 



4.2 A_purva as a classifier 

When running a_purva on the Skolnick set, we observed that relative gaps are smaller 
for similar domains than for dissimilar ones. This became even more obvious when we 
fixed a small upper bound of iterations and limited the computations only to the root 
of the B&B tree. The question then was to check if the relative gap can be used as a 
similarity index (the smaller is the relative gap, the more similar are the domains) which 
can be given to an automatic classifier in order to quickly provide a classification. 
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We used the following protocol : the runs of a_purva were limited to the root, with 
a limit of 500 iterations for the subgradient descent. We used the publicly available hi- 
erarchical ascendant classifier Chavl 120], which proposes a best partition of classified 
elements based on the derivative of the similarity index and thus requires no similarity 
threshold. For the Skolnick set, the alignment of all couples was done in less than 11 00 
seconds (with a mean computation time of 1.39 seconds/couple). The classification re- 
turned by Chavl based on the relative gap is exactly the classification at the fold level 
in SCOP. Taking into account that according to Table [T] 609 couples ran 1800 seconds 
without finding the solution, this result pushes to use the relative gap as a classifier 
Note also that we succeeded to classify the Skolnick set significantly faster than both 
previously published exact algorithms 1211 [5l that use similarity indexes based on lower 
bound only. This illustrates the effectiveness of using a similarity based on both upper 
and lower bounds. 

To get a stronger confirmation of a_purva classifier capabilities, we performed 
the same operation on the Proteus_300 set, presented in Table |3] The alignmenj^ of 
the 44850 couples required roughly 82 hours (with a mean computation time of 6,58 
seconds/couple). 

Table |4]presents the classification that we obtain. It contains 25 classes denoted by 
letters A- Y. This classification is almost identical to the SCOP one (at folds level) which 
contains 24 classes denoted by numbers (presented in Table O. 18 of the 24 SCOP 
classes correspond perfectly to our classes. Class 15 (resp. 24) contains two familiej*! 
that we classified in M and N (resp. V and W). Classes 9 and 1 1 were merged into 

class 1 and are indeed similar, with some domains (like dljgca_ and dlbOb ) having 

more than 75% of common contact^^- Class 18 was split into its two families (X and 
Y), but Y was merged with class 10. Again, some of the corresponding domains (e.g. 
dlbOOa_ and dlwbla4) are very similar, with more than 75% of common contacts. 



5 Conclusion 

In this paper, we give an efficient exact B&B algorithm for contact map overlap prob- 
lem. The bounds are found by using Lagrangian relaxation and the dual problem is 
solved by sub-gradient approach. The efficiency of the algorithm is demonstrated on a 
benchmark set of 40 domains and the dominance over the existing algorithms is total. 
In addition,its capacity as classifier (and this was the primary goal) was tested on a large 
data set of 300 protein domains. We were able to obtain in a short time a classification 
in very good agreement to the well known SCOP database. 

We are curently working on the integration of biological information into the con- 
tact maps, such as the secondary structure type of the residues (alpha helix or beta 
strand). Aligning only residues from the same type will reduce the research space and 
thus speed up the algorithm. 

'Detailed results of the runs will be available in our web page, 
'"in the SCOP classification. Families are sub-sub-classes of Folds. 

"The percentage of common contacts between domains i and j is j^^^^^'c ) where C, (resp Cj) denotes 
the number of contacts in domain (' (resp j), and CMO{i,j) is the number of common contacts between i and 
j found by a_purva. 
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class contains the hardest solved Skolnick set intstances. Column two(six) contains the 
names of the couples, column thi'ee(seven) is the score, column four(height) gives the time 
in seconds taken by LR algoritm, and column five(nine) presents the corresponding time 
taken by a_purva. 





Fold 


Family 


Proteins 


1 


Flavodoxin-like 


CheY-related 


IbOO, Idbw, Inat, Intr, 
lqmp(A,B,C,D), 3chy, 4tmy(A,B) 


2 


Cupredoxin-like 


Plastocyanin/ 
azurin-like 


Ibaw, lbyo(A,B), Ikdi, Inin, Ipla 
2b3i, 2pcy, 2plt 


3 


TIM beta/alpha- 
barrel 


Triosephosphate 
isomerase (TIM) 


lamk, law2, lb9b, Ibtm, Ihti 
Itmh, Itre, Itri, lydv, 3ypi, 8tim 


4 


Ferri tin-like 


Ferritin 


lb71, Ibcf, Idps, Ihfa, lier, Ircd 


5 


Microbial 
ribonucleases 


Fungal 
ribonucleases 


lrnl(A,B,C) 



Table 2: The Skolnick set 



Fold number 


SCOP fold 


SCOP family 




1 


7-bladed beta-propeller 


WD40-rcpt;al 


d 1 iirOa 1 . d 1 ncxbl. d 1 k8kc_, d Ip22a2. d 1 erja_ 
d 1 lbga_, d lpguu2. d lgxra_, d 1 pgiiul , d 1 nK);i2 
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2 


Acyl-CoA N-acyltransf erases 

(NAT) 


N- acetyl transferase, 
NAT 


d 1 nsla_. d lqsla_. d 1 vhsa_, d ls3za_, d 1 n7 1 a_ 
dltiqa_, dlq2ya_, dlghea_, dlufha_, dlvkca_ 


3 


Beta-Grasp 

(ubiquilin-likc) 


Uhiquitin-related 


dlwh3a_, dlmg8a_, dlxd3b_, dlwm3a_, dlwiaa_ 

dlv5oa_. dIvSfia ,d]v6L-a .dlwjna .dlwjiLa_ 


4 


C-lype lectin-like 


C-type lectin domain 


dltdqb_,dle87a_, dlkgOc_, dlqo3c_,dlsl4a_ 
dlh8ua_, dltn3_, dljzna_, d2afpa_, dlbyfa_ 


5 


Cytochrome P450 


Cytochrome P450 


dljipa_. dlizoa_. dlx8va_, dliD7a_, dl ipza_ 
dlpo5a_, dllfka_, dln40a_, dln97a_, dlcpt_ 


6 


DNA clamp 


DNA polymerase 

pruLTSsivily faclur 


dlb77al, dlplq_l, dlud9al, dldmla2, dlplq_2 

dliz5a2,dll6]al.dklinlal,dli/5al,dlLi7bal 


7 


Enolase N- terminal 
domain-like 


Enolase N- terminal 
domain-like 


dlec7a2, dlsjda2, dlr0ma2, dlwueaZ, dljpdx2 
dlrvka2, dlmuca2, dljpma2, d2mnr_2, dlyeya2 


8 


Ferredoxin-like 


IIMA, heavy metal-assodated 
domain 


dlfcOa_. dlfvqa_.dlawO_. dlmwya_. dlqupa2 
dlosda_, dlcc8a_, dlsb6a_, dlkqka_, dlcpza_ 


Canonical RBD 


dlno8a_, dlwgla_, dlooOb_, dlfxlal, dlh6kx_ 

dlwg4a_, dlsjqa_, dlwfOa_, dll3ka2, dlwhya_ 


9 


FL-lTilill-liU- 


Fcnilin 


dllWa .dlvcla .dlo9ni .dljsrua .dIvFja 

dlljuj . Jliil i.i . L.II . Jill lj . JlL.iim,i 


10 


Flavodoxin-like 


CheY-related 


dlkrwa_, d lmb3a_, d 1 cjkka,. d 1 bnOa_. d 1 a04a2 
dlw25al, dlw25a2, dloxkb_, dluOsy_, dlp6qa_ 


11 


Globin-like 


Globins 


dlbOb_, dlit2a_, dlx9fc_, dlh97a_, dlqlfa_ 

dlcqxal. dlwmub_, dlirda ., d3sdha . , d lgLva_ 


12 


Glutathione S-transf erase 
(GST), C-terminal domain 


Glutathione S -transferase 
(GST), C-terminal domain 


dloyjal.dleemal, dln2aal, d2gsq_l, dlfZeal 
dlnhyal, dlrSaal, dlmOual, dloe8al, dlk3yal 


13 


ImmunoglobuliQ-like 
beta-sandwich 


Fibronectin type HI 


dluc6a_ dlbqual, dln26a2, d2hft_2, dlaxib2 
dllwra_, dlfyhb2, dlcd9bl, dllqsr2, dlf6fb2 


CI set domains (antibody 
constant domain-like) 


dll6xal, d2fbjh2, dlk5nb_, dlmjuh2, dlfpSal 

dluvqal,dlrzfl2,dlmjul2, d3frual. dlk5nal 


I set domains 


dlgl4b_, dlzxq_2, dliray3, dlbiha3, dlp53a2 
dlev2e2, dlp53a3, dluctal, dlgsmal, dlrhfa2 


14 


LDH C-terminal domain-like 


Lactate & malate dehydrogenases 
C-terminal domain 


dlojua2, dlllda2, d7mdha2, dlt2da2, dlgvla2 
d2cmd_2, dlhyea2, dlez4a2, dlhyha2, dlb8pa2 


15 


NAD(P)-binding 
Rossmann-fold domains 


LDH N-terminal 
domain-like 


dluxjal, d2cmd_l, dlo6zal, dlobbal, dlldnal 
dlt2dal, dlb8pal, dlhyeal, dlhyhal, dls6yal 


Tyrosine-dependent 
oxidoreductases 


dldb3a_, dlsb8a_, dlek6a_, dlxgka_, dlja9a_ 
dli24a_, dlgy8a_, dliy8a_, dlvlOa_, dlw4za_ 


16 


Ntn hydrolase-like 


Proteasome subunits 


dlrypg_, dlrypd_, dlrypl_, d!rypa_, dlrypb_ 
dlq5qa^, dliypk_, dlryph_, dlrypl_, dlrypi_ 


17 


Nuclear receptor 
ligand-binding domain 


Nuclear receptor 
ligand-binding domain 


dlnq7a^, dlpzla_, dlrlkd_, dlt7ra_, dln46a_ 
dlpkSa^, dlxpca_, dlpq9a_, dlpdua_, dlxvpb_ 


18 


P-loop containing nucleoside 
triphosphate hydrolases 


Extended AAA 
ATPase domain 


dlw5sa2, dld2na_, dllv7a_, dlfima2, dlsxje2 
dll8qa2, dlnjfa_, dlsxja2, dlny5a2, dlr7ra3 


G proteins 


dlr8sa_, dlwbla4, dlmkya2, dlkkla3, dlctqa_ 
dlwf3al, dlr2qa_, dli2ma_, dlsvia_, d3raba_ 


19 


PDZ domain-like 


PDZ domain 


dlihja_, dlg9oa_, dlqava_, dlr6ja_, dlm5za_ 
dll6oa_, dlujva_, diiu2a_, dln7ea_, dlgmla_ 


20 


Periplasmic binding 
protein-like I 


L-arabinose binding 
protein-like 


d 1 s>Lga_, d2dn . d 1 jyca_. d 1 guda_. d 1 jdpa_ 

dljx6a_, dlbyka_, dlqoOa_, dSabp , dltjya_ 
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21 


Periplasmic binding 
protein-like II 


Phosphate binding 
protein-like 


dlxvxa_,dllst , dly4ta_, dlamr__. dlursa_ 

dli6aa_, dlpb7a_, dlii5a_, dlsbp , dlalg 


22 


PLP-dependent transferases 


AAT-like 


dlbwOa_, dltoia_, dlw71a_, dlo4sa_, dlm6sa_ 

d 1 uu 1 a_. d 1 v2da_. d 1 u08a_. d 1 k5a_. d 1 gdea_ 


23 


Prolcin kinase- like 
(PK-like) 


Protein kinases 
catalytic subunit 


dltkia_,dls9ja_, dlk2pa_, dlvjya_,dlphk 

dlxkka_, dlrdqe_, dlfvra_, dlu46a_, dluu3a_ 


24 


TIM beta/alpha-barrel 


Beta-glycanases 


d 1 xyza_. d 1 bqea_, d 1 bhga3, d 1 nofa2. d 1 ecea_ 
dlqnra_, dlfoba_, dlhlna_, dluhva2, d7a3ha_ 


Class I aldolase 


dln7ka_, dlw3ia_, dlvlwa_, dlgqna_, dlub3a_ 

dl I6wa_, d 1 o5ka_. d 1 sfia_. dlplxa_, dloj)ca_ 


Table 3: Scop classification of the Proleus_300 set. 


Class name 


SCOP fold 


SCOP family 


Domains name 


A 


7-bladed beta-propeller 


Wr>40-repeat 


dlnrOal. dlnexb2, dlk8kt:_. dlp22a2, dlerja_ 
dltbga_, dlpgua2, dlgxra_, dlpgual, dlnrOa2 


B 


Acyl-CoA N-acyltransferases 

(NAT) 


N-acetyl transferase, 

NAT 


dlnsla_, dlqsta_, dlvhsa_, dls3za_, dln71a_ 

dltiqa_, dk|2ya_, dlghea_. dluflia_. dlvkca_ 


C 


Be la- Grasp 
(ubiquitin-lLke) 


Ubiqui tin-related 


dlwh3a_, dlmg8a_, dlxd3b_, dlwm3a_, dlwiaa_ 
dlv5oa_, dlv86a_, dlv6ea_, dlwjna_, dlwjua_ 


D 


C-type lectin-like 


C-type lectin domain 


d 1 [dqb_. d 1 e87a_. d 1 kgOc_. d 1 qy3c_. d 1 sl4a_ 
dlh8ua_, dltn3 , dljzna_, d2afpa_, dlbyfa_ 


E 


Cytochrome P450 


Cytochrome P450 


dljipa_, dlizoa_, dlx8va_, dlio7a_, dljpza_ 

d 1 po5a_. d 1 lflca_. d 1 n40a_. d 1 n97a_. d 1 L:pl_ 


F 


DNA.'laiup 


DNA polyiiKTu^L' 


dlb77al.dlplq 1 . J I ud'-Jal . J I Jiiild2. J 1 plq 2 
dli/.xi2. dlldljl. Llldiiilal. dill Ixil 


G 


Enolase N-terminal 
domain-like 


Enolase N-terminal 
domain-like 


dlec7a2. dlsida2, d 1 rOLna2, d 1 wiiea2. d Ijpdx2 
dlrvka2, dlmuca2, dljpma2, d2mnr_2, dlyeya2 


H 


Ferredoxin-lilce 


HMA, heavy metal-associated 

domain 


dlfeOa_, dlfvqa_, dlawQ . dlmwya_, dlqupa2 

d 1 ysda_. d 1 cc8a_, d 1 sb6a_, d 1 kqka_. d 1 L:pza_ 


Canonical RBE5 


dliioSa .dlvv^la .dhioOb . J I l\lal . J 1 Ii6k\ 
dl«j;la .dl.-.n.ia .dl\Ulla . d 1 l.-ka2. d 1 « li> a 


I 


Ferritin-like 


Ferritin 


dllb3a_. dlvela_, dlo9ra_, dljgca_, dlvlga_ 
dltjoa_, dlnf4a_, dljiga_, dlji4a_, dlumna_ 


Globin-like 


Globins 


dlbOb_, dlit2a_, dlx9fc_, dlh97a_, dlqlfa_ 

dlcqxal, dlwmub_, dlirda_, d3sdha_, dlgcva_ 


J 


Glutathione S -transferase 
(GST), C-terminal domain 


Glutathione S-transf erase 
(GST), C-terminal domain 


dloyjal, dleemal, dln2aal, d2gsq_l, dlf2eal 
dlnhyal, dlrSaal, dlmOual, dloeSal, dlk3yal 


K 


Immunoglobulin-like 
beta-sandwich 


Fibronectin type HI 


dluc6a_, dlbqual, dln26a2, d2hri_2,dlaxib2 
dllwra_, dlfyhb2, dlcd9bl, dllqsr2, dlf6fb2 


CI set domains (antibody 
constant domain-like) 


dll6xal, d2fbjh2, dlk5nb_, dlmjuh2, dlfp5al 

dluvqaLdlrzfl2.dlm.iul2.d3fmaLdlk5nal 


I set domains 


dlgl4b_, dlzxq_2, dliray3, dlbiha3, dlp53a2 
dlev2e2, dlp53a3, dluctal, dlgsmal, dlrhfa2 


L 


LDH C-terminal domain-Like 


Lactate & malate dehydrogenases 
C-terminal domain 


dlojua2, dlllda2, d7mdha2, dlt2da2, dlgvla2 
d2cmd_2, dlhyea2, dlez4a2, dlhyha2, dlb8pa2 


M 


NAD(P)-binding 
Rossmami-fold domains 


LDH N-terminal 
domain-like 


dluxjal, d2cmd_l, dlo6zal, dlobbal,dlldnal 
dlt2dal, dlbSpal, dlhyeal, dlhyhal, dls6yal 
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N 


NAD(P)-bin(ling 
Rossmann-f old domains 


Tyrosine-dependcnl 
oxidoreductases 


dldb?a_. dlsb8a_. dlt:k6a .dUgka .dlja9a_ 
dli24a_,dlgy8a_,dliy8a_, dlvlOa_,dlw4za_ 


O 


Ntn hydrolase-like 


Proteasome subunits 


dlrypg_, dlrypd_, dlrypl_, dlrypa_, dlrypb_ 

dlq5qa_. dlrypk_, d 1 ryph_. dlrypl_, dlrypi_ 


P 


Nuciciu ri;ci;plor 
ligand-binding domain 


Nuclear recepior 
hgand-binding domain 


dlnq7a_, dlpzla_, dlrlkd_,dlt7ra_, dln46a_ 
dlpk5a_, dlxpca_, dlpq9a_, dlpdua_, dlxvpb_ 


Q 


PDZ domain-like 


PDZ domain 


dlihja_, dlg9oa .dlqava .dlr6j:i ,dlm5za_ 
dll6oa_, dlujva_, dliu2a_, dln7ca_, dlgmla_ 


R 


Periplasmic binding 

prolixin- liki: I 


L-arabinose binding 

protcin-liko 


dlsxga_, d2dri_, dljyea_, dlguda_, dljdpa_ 

d 1 jx6a_. d 1 byka_. d 1 qDOa_, d8abp_, d lljya_ 


S 


Periplasmic binding 
protein-liken 


Phosphate binding 
protein-like 


dlxvxa_, dllst ■dly4ta .dlamf . dlursa_ 

dli6aa_, dlpb7a_, dlii5a_, dlsbp_, dlatg_ 


T 


PLP-dependent transferases 


AAT-like 


d 1 bwOa_. d 1 tDia_. d 1 w71a_. d 1 y4sa_. d 1 m6sa_ 
dluula_, dlv2da_, dlu08a_,dllc5a_, dlgdea_ 


U 


Protein kinase-like 

(PK-lilct:) 


Protein kinases 
catalytic subunit 


dltkia_, dls9ja_, dlk2pa_, dlvjya_, dlphk_ 

d 1 xkka_. d 1 rdqc_. d 1 fvra_. d 1 ii46a_, d luu3a_ 


V 


TIM beta/alpha- barrel 


Beta-glycanases 


dlxyza_, dlbqca_, dlbhga3, dlnofa2, dlecea_ 
dlqnra_, dlfoba_, dlhlna_, dluhva2, d7a3ha_ 


w 


HM beta/alpha-barrel 


Class I aldolase 


dln7ka_. dlw3ia_. dl vlwa_. dlgqna_, dlub3a_ 
dll6wa_, dlo5ka_, dlsfla_, dlplxa_, dlojxa_ 


X 


P-loop containing nucleoside 
triphosphate hydrolases 


Extended AAA 
ATPase domain 


dlw5sa2, dld2na_, dllv7a_, dlfima2, dlsxje2 
dll8qa2, dlnjfa_, dlsxja2, dlny5a2, dlr7ra3 


Y 


P-loop containing nucleoside 
triphosphate hydrolases 


G proteins 


dlr8sa_, dlwbla4, dln[ikya2, dlkkla3, dlctqa_ 
dlwfBal, dlr2qa_, dli2ma_, dlsvia_, d3raba_ 


Havodoxin-like 


CheY-related 


dlkrwa_, dlmbSa^, dlqkka^, dlbOOa^, dla04a2 
dlw25al, dlw25a2, dloxkb_, dluOsy_, dlp6qa_ 



Table 4: Relative gap based classification of the Proteus_300 set. Col- 
umn 2 and 3 present the SCOP classification of the elements inside each 
classes 
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Figure 6: Comparing the quality of lower and upper bounds on the set of unsolved 
instances. a_purva clearly outperforms LR on the quality of its bounds. 
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